Breast Cancer Risk Assessment Tools for Stratifying Women into Risk Groups: A Systematic Review

Simple Summary Early detection of breast cancer in asymptomatic women through screening is an important strategy in reducing the burden of breast cancer. In current organized breast screening programs, age is the predominant risk factor. Breast cancer risk assessment tools are numerical models that can combine information on various risk factors to estimate the risk of being diagnosed with breast cancer within a certain time period. These tools could be used to offer risk-based screening. This systematic review assessed, using a variety of methods, how accurately breast cancer risk assessment tools can group women eligible for screening within a population, into risk groups, so that each group could potentially be offered a screening protocol with more benefits and less harms compared to current age-based screening. Abstract Background: The benefits and harms of breast screening may be better balanced through a risk-stratified approach. We conducted a systematic review assessing the accuracy of questionnaire-based risk assessment tools for this purpose. Methods: Population: asymptomatic women aged ≥40 years; Intervention: questionnaire-based risk assessment tool (incorporating breast density and polygenic risk where available); Comparison: different tool applied to the same population; Primary outcome: breast cancer incidence; Scope: external validation studies identified from databases including Medline and Embase (period 1 January 2008–20 July 2021). We assessed calibration (goodness-of-fit) between expected and observed cancers and compared observed cancer rates by risk group. Risk of bias was assessed with PROBAST. Results: Of 5124 records, 13 were included examining 11 tools across 15 cohorts. The Gail tool was most represented (n = 11), followed by Tyrer-Cuzick (n = 5), BRCAPRO and iCARE-Lit (n = 3). No tool was consistently well-calibrated across multiple studies and breast density or polygenic risk scores did not improve calibration. Most tools identified a risk group with higher rates of observed cancers, but few tools identified lower-risk groups across different settings. All tools demonstrated a high risk of bias. Conclusion: Some risk tools can identify groups of women at higher or lower breast cancer risk, but this is highly dependent on the setting and population.


Introduction
Early detection of breast cancer in asymptomatic women through screening is an important strategy in reducing the burden of breast cancer. Mammographic screening programs have decreased mortality for screened women and reduced the intensity of

Study Registration
Our Patient, Intervention, Comparison, Outcomes (PICO) question is 'For asymptomatic women aged ≥40 years, how accurately do different breast cancer risk assessment tools assign women to risk groups?', where the term 'risk assessment tool' is used synonymously for risk prediction tool, prognostic model, risk prediction model, risk model, and breast cancer prediction model. The protocol for this systematic review was registered on the International Prospective Register of Systematic Reviews (PROSPERO) as part of a larger protocol exploring breast cancer risk assessment tools (CRD42020159232). We followed the requirements of the PRISMA 2020 guidelines for conducting and reporting of systematic reviews [23].

Eligibility Criteria
The current analysis was confined to articles comparing breast cancer risk assessment tools on the same study cohort; cohorts had to consist of asymptomatic women undergoing population mammographic screening. We excluded articles limited to cohorts of women undergoing diagnostic breast imaging, specific ethnic groups or women with high risk of breast cancer as these represent sub-groups of the screened population. We considered only external validation studies (so that the study cohort was different from that used to develop each tool being compared), We included randomised controlled trials, paired cohort studies or systematic reviews thereof. Due to the need for sufficient follow-up between risk assessment and cancer outcomes, we included prospective or retrospective cohort studies (based on timing of risk predictor data collection in relation to outcome occurrence). All other study designs (such as cross-sectional studies or case-control studies) were excluded.
We included risk assessment tools based on questionnaire data with or without genetic and/or breast density information, where estimated future risk was projected to a minimum of two years (in line with the most common screening interval of most population breast cancer screening programs). Tools designed to be calibrated to the target population prior to use were included if they were developed on a different population to the study cohort. Tools requiring any non-standardised input (e.g., subjective assessment by a clinician) were excluded.
We restricted our analysis to articles published from 2008, aiming to include studies likely to use more relevant imaging methods and more recent versions of risk assessment tools while not excluding relatively contemporary studies with longer periods of follow-up. Only English language peer-reviewed publications were included; conference abstracts, reviews, letters, editorials and comments were excluded.
The primary outcome was expected versus observed incidence of invasive breast cancer (with or without in situ disease). Secondary outcomes were breast cancer mortality, incidence for different types of breast cancer as defined by characteristics such as tumour subtype, grade, size, nodal involvement, and interval breast cancers (i.e., cancers diagnosed following a negative screen and before any consecutive screens). Articles that did not report expected versus observed (E/O) calibration outcomes according to risk groups determined by the risk assessment tool were excluded.
Results were excluded from the analysis if risk was projected beyond the period for which the tool was developed. Five-year risk was the primary outcome compared and reported; results for 10-year risk are included in Supplementary Materials.
We contacted corresponding authors when there was a lack of clarity around criteria for inclusion in our review, allowing two weeks for a response, after which we sent a reminder in addition to contacting other authors on the paper. If no response was received, the study was excluded. Extracted data is presented in Supplementary Dataset S1.

Information Sources and Search Strategy
An experienced systematic reviewer (VF) searched on 1 July 2021 for English-language reports published from 1 January 2008 to 29 June 2021 on the following databases: (i) Ovid Medline and Embase; (ii) The Cochrane Database of Systematic Reviews (CDSR) and (iii) PROSPERO. An updated search until 20 July 2021 was also performed for these databases. For Ovid databases, database-specific subject headings and text terms were combined for breast cancer, risk assessment and calibration terms (see Supplementary Methods). The CDSR was searched by combining "breast cancer" and "risk" text terms. Reference lists of relevant systematic reviews and full-text articles were also scanned for additional potentially relevant reports by two systematic reviewers (VF, DC). The search strategy is presented in supplementary Table S1.

Selection Process
Titles and abstracts of the articles identified via the literature searches were screened against pre-specified inclusion criteria and split equally between two reviewers (VF, DC) with 20% assessed by both reviewers. The two reviewers independently assessed full-text articles of potential or unclear relevance for inclusion using a form with pre-specified selection criteria. Reviewers were not blinded to journal titles or study authors/institutions. Disagreements were resolved by discussion or adjudication by a third reviewer (SH).

Data Collection
Two independent reviewers (VF, DC) equally split the extraction of pre-determined study characteristics and results data from each included study and then reviewed the other's extractions for accuracy. Disagreements were resolved by discussion or adjudication by a third reviewer (SH, LV or CN); experienced statisticians were consulted to advise upon or review article methodology or calculations (SE or CN).
The following information was extracted: first author, publication year, country, study design, setting, study start, participant inclusion/exclusion criteria, screening protocol, population characteristics, risk assessment tool information, follow-up duration, risk prediction interval, reported relevant outcomes, E/O estimates and 95% confidence intervals (CIs), observed rates (or if missing, the observed number of breast cancers and number of women in each risk category) and other relevant information (including methods used, factors potentially affecting risk of bias). If E/O ratios, their 95% CIs or data for observed rates were not reported, these were calculated by the systematic reviewers from available data or plots where possible (VF, DC, SE). Ninety-five percent CIs were calculated using the following formula: E/O × expˆ(±1.96 × sqrt(1/O)) [23,24]. If there was insufficient data to perform calculations, authors were contacted and if attempts to obtain data were unsuccessful, the tool or study was excluded. In addition, where a tool version remained unclear after contacting authors and major updates to risk predictors had occurred between versions, the tool was excluded. It should be noted that risk predictors may be identified as risk factors, covariates, risk indicators, prognostic factors, determinants or independent variables [24].
We also identified high, moderate and low risk groups for each tool in each cohort. These groups were dependent on the number of quantiles the cohort of interest was divided into and whether they had the equivalent number of participants in each one. In general, when the cohort was divided in equal quartiles or deciles, we assumed the high-risk group corresponded to quartile 5 or deciles 9 and 10, the low-risk group corresponded to quartile 1 or deciles 1 and 2 while moderate-risk groups correspond to the remaining quantiles (quartiles 2-4 or deciles 3-8).

Metrics for Evaluating Risk Assessment Tools and Statistical Analysis
Prior to analysis, risk assessment tool comparisons were grouped by comparator tool (which could be any version of that tool). Data was extracted into Microsoft Excel and then plotted for each tool, age range and predicted year of risk.
We generated various data presentations and metrics to help evaluate and compare studies, as follows: A. Goodness of fit between expected (predicted) and observed outcomes: 1.
Plotted ratios of expected versus observed cancers, by population percentile. The E/O ratio (in log10 scale) with 95% confidence intervals were plotted according to risk group assignment using the mid-point percentile of each risk group in the study population. This facilitated standardisation of comparisons between tools that had a different number of risk groups and/or assigned different proportions of women to each risk group.

2.
The total number of women in each study cohort in risk groups for which the E/O 95%CIs included unity. This helped indicate the proportion of each study cohort that was well-validated by the tool, noting that this is more likely for smaller studies (and therefore wider CIs).

3.
Calibration belt goodness-of-fit tests. We assessed goodness of fit between expected (predicted) probabilities of developing breast cancer and observed data using calibration belts [25] as applied in Li et al., 2021 [26], where a p-value <0.05 indicated miscalibration by the tool [25].
B. Analysis of observed outcomes by risk group classification: 1.
Observed cancer rates (number of breast cancers divided by the number of women per 10,000 for each risk category), by mid-point percentile of each risk group in the study population. This helped to standardise comparisons.

2.
Characterisation of the functional form (curve) of observed cancer incidence rates according to increasing risk group, classified as either: 'increasing' (observed rates consistently increasing across risk categories), 'monotonic' (i.e., increasing or remaining steady across groups) or 'fluctuating' (all other options).

3.
Assessment of whether highest-risk women could be distinguished from women at more moderate-risk. We compared the observed breast cancer rate corresponding to the mid-range risk groups (usually quintiles 2-4 or deciles 3-8) with the highest risk group (quintile 5 or deciles 9-10). p-values <0.05 indicated a statistically significant difference and, therefore, good allocation of women to the highest risk group. To ensure comparability of findings, if >25% of the study cohort was allocated to the highest risk groups, p-values were reported but not taken into consideration when drawing conclusions regarding a particular tool. Consequently, mid-range risk groups would be expected to include ≥50% of the study cohort.

4.
Assessment of whether lowest-risk women could be distinguished from women at more moderate-risk. As for (3 above), but for the lowest risk group (quintile 1 or deciles 1-2 or the equivalent sub-groups representing ≤25% of cohort), compared to the remainder (quintiles 2-4 or deciles 3-8, or equivalent subgroups representing ≤50% of the cohort). To ensure comparability of findings, if >25% of the study cohort was allocated in the lowest risk groups, p-values were reported but not taken into consideration when drawing conclusions regarding a particular tool.
Plots and all statistical analyses were conducted using STATA (version 17, Stata Corporation, College Station, TX, USA).

Risk of Bias Assessment
Two independent reviewers (DC, VF) assessed the risk of bias for each included study. Differences were resolved by consensus or adjudication from a third reviewer (JS). Risk of bias was assessed using the 'Prediction model Risk Of Bias ASsessment Tool' (PROBAST), specifically designed to assess the Risk of Bias for, and the applicability of, diagnostic and prognostic prediction model studies [24]. PROBAST is organised into four domains; (i) participants (assessing suitable data sources or study designs and appropriate inclusions or exclusions), (ii) predictors (assessing predictor definition and measurements, knowledge of outcome influencing predictor assessment and whether the tool is used as designed if predictors are missing at time of validation), (iii) outcome (assessing methods used to classify participants with or without outcome, pre-specified/standard definition of outcome used, predictor exclusion from outcome definition, similar definition and determination of outcome for all participants, knowledge of predictor influencing outcome assessment, time interval between predictor assessment and outcome determination), and (iv) analysis (assessing reasonable number of participants with outcome, handling of continuous and categorical predictors, enrolled participant inclusion in analysis, handling of participants with missing data, handling of data complexities, evaluation of relevant tool performance measures). Each domain contains signalling questions to facilitate a structured judgement of risk of bias; the overall rating for a domain can be classified as either "low", "high" or "unclear" risk of bias. Each study is also allocated an overall risk of bias rating: "low", if no relevant shortcomings were identified in the risk of bias assessment; "high", if at least one domain was assessed as high risk of bias and "unclear" if risk of bias was assessed as unclear for at least one domain (and no other domains assessed as high risk of bias).
For each study, a separate risk of bias assessment was conducted for each distinct risk assessment tool validated, for each individual outcome and each cohort included [24]. Outcomes with multiple time points (e.g., 5-and 10-year risk predictions) were assessed separately because ratings for signalling questions on appropriate time interval between predictor assessment and outcome determination, and reasonable number of participants with outcome, could differ. As such, it was possible for a single study to have multiple overall risk of bias assessments.
Rulings were developed where necessary to account for judgements that required topic-specific knowledge or statistical expertise. These rulings were initially trialled independently over several studies by the same two reviewers (DC, VF) with third reviewer input from a senior researcher (JS) where required. It was decided a priori that: (i) risk of bias domains that contained signalling items relating only to model development would be omitted as the primary interest of this systematic review was risk assessment tool validation and (ii) the applicability of a study would not be formally assessed by the PROBAST tool; instead, concerns would be highlighted where necessary in the discussion. We sought statistical advice to develop rulings for items in the analysis domain as suggested by PROBAST. When assessing the reasonable number of participants with outcome PROBAST recommends that validation studies should include at least 100 participants with outcomes. After consulting statistical experts (SE, DO'C), it was decided a priori that a study would qualify for a low risk of bias rating if this was the case for every risk category. Where data for observed incidence of breast cancer per risk category was provided by authors or calculated by reviewers from calibration figures, this was used to inform our ratings. Otherwise, risk of bias was appraised based on the information reported in the article and included references. For the handling of missing data, based on methodological advice (QL), it was decided a priori that a study performing multiple imputation would qualify for low risk only if <50% of values were originally missing (and thus imputed) for a predictor and the missing data were missing at random [27,28]. Figure 1 summarises the search process conducted. The search strategy identified a total of 5114 records of which 3405 remained after duplicates were removed. Of these, 3324 records were excluded based on title and abstract review. Full texts or records of 91 potentially relevant reports were assessed according to the eligibility criteria. This included 10 additional articles identified from citation searching of full text articles and 1 potentially eligible article from the update search conducted on the 20 July 2021. A total of 78 reports were found not to be relevant and therefore excluded. We contacted authors to confirm the eligibility of two tools for one validation cohort [29]. Common reasons for exclusion included ineligible study design, ineligible population and E/O not reported by risk category. Further details on reasons for exclusions (studies and tools) and information regarding authors contacted are details in a Supplementary List S1.  Figure 1 summarises the search process conducted. The search strategy identified a total of 5114 records of which 3405 remained after duplicates were removed. Of these, 3324 records were excluded based on title and abstract review. Full texts or records of 91 potentially relevant reports were assessed according to the eligibility criteria. This included 10 additional articles identified from citation searching of full text articles and 1 potentially eligible article from the update search conducted on the 20 July 2021. A total of 78 reports were found not to be relevant and therefore excluded. We contacted authors to confirm the eligibility of two tools for one validation cohort [29]. Common reasons for exclusion included ineligible study design, ineligible population and E/O not reported by risk category. Further details on reasons for exclusions (studies and tools) and information regarding authors contacted are details in a Supplementary List S1. The remaining 13 articles included in this review examined the prediction of breast cancer across 15 cohort studies applying 11 distinct breast cancer risk assessment tools of different versions. Summary characteristics of included articles are presented in Table 1. All studies were prospective in design apart from one retrospective study [30]. Ten of the 13 articles were from North America and Europe and compared more than two risk assessment tools based on a 5-year risk prediction interval. Only two articles presented findings for 5-and 10-year tool-determined risk [31,32]. The remaining 13 articles included in this review examined the prediction of breast cancer across 15 cohort studies applying 11 distinct breast cancer risk assessment tools of different versions. Summary characteristics of included articles are presented in Table 1. All studies were prospective in design apart from one retrospective study [30]. Ten of the 13 articles were from North America and Europe and compared more than two risk assessment tools based on a 5-year risk prediction interval. Only two articles presented findings for 5-and 10-year tool-determined risk [31,32]. The tools assessed included data from questionnaires, with or without information on mammographic breast density and PRS. The number of risk predictors varied between tools, from as few as five (e.g., Chen version 1) [41] to as many as 13 (e.g., Tyrer-Cuzick version 8.0b [31] although it should be noted that some studies did not have data for all the risk predictors specified by the tool they assessed. Risk predictors considered in each tool by each study are presented in Supplementary Table S2. The Breast Cancer Risk Assessment Tool (BCRAT; also known as the Gail model), was the most frequently assessed tool in publications assessed (9 of 13 articles). This was followed by the Tyrer-Cuzick tool (also known as the IBIS risk assessment tool) in 5 articles, and BRCAPRO and iCARE-Lit in 3 articles each.

Selection of Articles and Summary Characteristics
Two articles evaluated the effect of adding breast density data: McCarthy et al. [32] compared 5-year risk using Tyrer-Cuzick version 7 versus version 8.0b which had breast density incorporated within the tool and Brentnall et al. [38] assessed 10-year risk using Tyrer-Cuzick version 7.0 with and without breast density data. In two more articles, tools with integrated breast density data (Chen version 1; Tyrer-Cuzick version 8.0b) were compared to other tools; in Choudhury et al., 2020 [35] Tyrer-Cuzick version 8.0b was compared to iCARE tool variants, and in Arrospide et al. [41] Chen version 1 was compared to BCRAT version 1.
Only one study assessed the effect of PRS data on existing tools; in Hurson et al. [29] a 313-variant polygenic score was added to two iCARE risk assessment tools (iCARE-Lit and iCARE-BPC3).
Evidence was available to compare tools in terms of risk of invasive breast cancer, however, evidence was sparse for in situ breast cancer incidence, while no data was available on breast cancer incidence according to prognostic indicators (e.g., tumour subtype, grade, size, nodal). Therefore, these outcomes were not able to be assessed.

Goodness-of-Fit
Absolute risk calibration is shown for various tools and tool comparisons (along with observed rates of incident breast cancer) in Figure 2A-C and Supplementary Figure S1A-C. In terms of goodness-of-fit between estimated and observed outcomes, no risk assessment tool was identified as being consistently well-calibrated in multiple studies. As can be observed from Table 2, many tools showed good calibration in some but not all studies: namely AABCS [32,40], BCRAT version 3 [30,35,36], BCRAT version 4 [31,33,34], Tyrer-Cuzick version 8.0b [31,[33][34][35], iCARE-Lit and iCARE-BPC3 [29,35], and BRCAPRO version 2.1 [31,34]. In contrast, some tools did not demonstrate good calibration across studies; examples include BCRAT version 2 [32,40] and Tyrer-Cuzick version 7 [31,34]. There were other tools that were applied in single cohorts within this review, and thus could only be assessed in only one population and one setting. Of these, six showed a good fit (BCRAT version 1 [41]; Chen version 1 [41]; BCRmod [36]; BCRmod recalibrated [36]; KREA for women over 50 years [37]; KRKR [37]) and five showed evidence of miscalibration (i.e., p < 0.05) (BOADICEA [31]; ER- [39]; ER+ [39]; KREA for women under 50 years of age [37]; original Korean tool [40]; updated Korean tool [40]). To assist with comparison of studies, the x-axis shows the percentile distribution of groups being reported, with data points shown for the mid-points of each group. Red squares show the 'expected over observed' ratio for each risk group (with 95% confidence intervals shown), indicating calibration between expected and observed cancers at a risk group level. Blue circles show the corresponding observed rate of breast cancers within the study group, indicating the gradient of rates across the risk groups (expected to increase from left to right in accordance with increases in estimated breast cancer risk). Italic font indicates the risk tool being assessed, with the study cohort abbreviation also shown). * tools were calibrated to local population.
Combining breast density data with a tool score [38] or integrating breast density within a tool generating a new tool version [Tyrer-Cuzick version 7 vs. Tyrer-Cuzick version 8.0b [34]) did not improve the goodness of fit of the tool, with evidence of miscalibration in both cases.
Addition of PRS data, in the single study that evaluated a specific score [29], did not improve the goodness-of-fit of neither iCARE-Lit nor iCARE-BPC3, as assessed on different cohorts ([UK Biobank; Women's Genome Health Study (US)] (Supplementary Figure  S1). For these evaluations, there was evidence of miscalibration before and after addition of PRS information (p < 0.05). BCRAT modifications; (C) BCRAT vs. other risk assessment tools. Plots are then presented according to first author name. (The number of data points in each graph is determined by the number of risk groups that were reported in each study. To assist with comparison of studies, the x-axis shows the percentile distribution of groups being reported, with data points shown for the mid-points of each group. Red squares show the 'expected over observed' ratio for each risk group (with 95% confidence intervals shown), indicating calibration between expected and observed cancers at a risk group level. Blue circles show the corresponding observed rate of breast cancers within the study group, indicating the gradient of rates across the risk groups (expected to increase from left to right in accordance with increases in estimated breast cancer risk). Italic font indicates the risk tool being assessed, with the study cohort abbreviation also shown). * tools were calibrated to local population.
Combining breast density data with a tool score [38] or integrating breast density within a tool generating a new tool version [Tyrer-Cuzick version 7 vs. Tyrer-Cuzick version 8.0b [34]) did not improve the goodness of fit of the tool, with evidence of miscalibration in both cases.
Addition of PRS data, in the single study that evaluated a specific score [29], did not improve the goodness-of-fit of neither iCARE-Lit nor iCARE-BPC3, as assessed on different cohorts ([UK Biobank; Women's Genome Health Study (US)] (Supplementary Figure S1). For these evaluations, there was evidence of miscalibration before and after addition of PRS information (p < 0.05).
No change was observed in the calibration of most tools (BCRAT version 2; BCRAT version 4; BRCAPRO version 2.1; BOADICEA) for longer-term risks, with evidence of miscalibration for both 5-and 10-year risk. The only exception was the AABCS tool, for which the goodness-of-fit improved for 10-year risk [32].
The contribution of PRS to improving risk tool accuracy varied between tools and sub-groups in Hurson et al. [29]. For example, PRS improved the consistency of the graded association between risk groups and observed rates for the iCARE-Lit tool applied to a UK cohort of women aged under 50 years but did not improve the trend for women in that cohort aged 50 years or older, nor for a US cohort aged 50-74 years. Another iCARE tool variant of (iCARE-BPC3) worsened the graded association between risk groups and observe cancer rates for women aged 50 years or older in a UK cohort.
The addition of mammographic density appeared to improve some tools slightly for some risk groups. For example, the Tyrer-Cuzick tool reported in McCarthy et al. [34] improved differentiation for the higher-risk groups but worsened the graded association in lower-risk groups (Figure 2A), and Tyrer-Cuzick applied in Brentnall et al. [38] did not discernibly improve the association ( Figure S1C, Supplementary Materials).
There was limited evidence to evaluate the effect of a longer risk-prediction interval on observed cancer incidence. The AABCS tool appeared to better differentiate lower and higher-risk groups at 10 years than 5 years, [32,40] and the BCRAT version 2 tool was more clearly graded with longer-term cancer incidence ( Figure 2B; Supplementary Figure  S1C) [32]. It was not possible to evaluate the results from the risk assessment tools reported by Terry et al. [31] due to the uneven distribution of the cohort among the five risk groups reported.

Risk of Bias Assessment
Risk of bias assessments were undertaken for each of the tools evaluated in each study. The overall risk of bias rating for all 47 risk of bias assessments undertaken was high ( Table 3). The overall risk of bias for the participants domain was low for 75% of assessments. For the predictor domain, the overall rating was low for 36% of assessments, high for 36% and unclear for 28%; for the outcome domain, 66% of assessments were rated as unclear and 34% at high risk while the overall rating for the analysis domain was high risk of bias for all 47 assessments. Detailed findings listed per risk of bias domain are provided in the Supplementary Table S3.  Common factors that contributed to unclear or high risk of bias ratings included: handling of missing predictors at the time of validation when a tool did not allow for an unknown or missing option; specification of standard measurement of predictors at baseline; minimal reporting of predictor assessments blind to outcome; limited information provided around methods used to determine outcomes; omission of standard outcome definitions and standardised follow-up protocol and lack of clarity on the number of women who had full follow-up for the time interval between predictor assessment and outcome determination. Furthermore, the analysis domain rated poorly with all tools examining 5-year risk having <100 events across risk categories, although this was achieved for most tools assessing 10-year risk. Additionally, often no direct reference was made to baseline questionnaires preventing a clear assessment of the handling of continuous and categorical predictors (i.e., if data transformation was required between collection vs. input) unless stated in the text.
Tools tended to rate poorly for methods regarding handling of missing data. The main reasons for poor ratings included inappropriate assumptions, omitting predictors with missing data in general or for a particular predictor, and imputation of predictors with >50% missing participant data.

Summary of Main Results
This systematic review of studies comparing multiple breast cancer risk assessment tools within general populations examined several metrics to evaluate risk assessment tools, namely: the ratio of the expected over observed number of breast cancer cases; evidence of miscalibration; the proportion of study group where E/O and 95%CI includes unity, and how these related to the observed cancer incidence rates across assigned risk groups. We found that no tool was consistently well-calibrated across multiple studies, and breast density or polygenic risk scores did not improve calibration. While most tools identified a risk group with higher rates of observed cancers, few tools identified lower-risk groups across different settings.
We did not apply a single metric to compare tools because the interpretation and value of each metric depends on how the risk assessment tool might be used. Where risk assessment tools are being used to advise an individual woman about her estimated breast cancer risk, specified as, for example, her 5-year or lifetime risk, the tool should have demonstrated very good calibration of E/O rates within her population to ensure a sufficiently accurate estimate. Communication of this information is also important, as these estimates are often misinterpreted as individual level risk so that, for example, an estimated 3% five-year risk is interpreted as the individual woman having a 3% risk of breast cancer in the next five years, when instead it indicates that 3% of women in the risk group to which she belongs would be expected to have a breast cancer diagnosed in the next five years [42].
The individual-level risk estimates generated by risk tools are also used in clinical practice to advise and manage women according to a risk group assignment based on their estimated risk of breast cancer, without necessarily reporting the estimated individual breast cancer risk for each woman. For example, the Royal Australian College of General Practitioners (RACGP) guidelines define women at 'moderately higher' risk as those with a 1.5 to 3 times higher than average risk, and women at 'potentially high' risk as more than 3 times the average population risk and recommend management based on these risk categories such as screening frequency and/or referral to specific breast imaging surveillance tests, or referral to specialist high-risk services [43]. These tools generally rely on individual risk estimation as the basis for risk group allocation. For example, the iPrevent tool draws on either the IBIS or BOADECIA risk tool depending on an assessment of initial factors such as family history, then assigns the individual to a risk group following the RACGP guidelines. Each woman's risk relative to the average is defined by the ratio of her estimated residual lifetime risk (to age 80) and the average residual lifetime population risk for women of her age. In a validation study of over 15,000 Australian women, iPrevent demonstrated good calibration for women under 50 years (E/O: 1.04; 95% CI = 0.93 to 1.16) but poor calibration for women aged 50 years and older (E/O: 1.24; 95% CI = 1.11 to 1.39), largely due to overestimation of risk in the highest study group decile [44]. These findings are concerning in terms of providing accurate risk estimation to individual women however, as noted by Phillips et al., "the extent of overestimation is unlikely to be of clinical importance because the actual 10-year [breast cancer] risks for these women substantially exceed thresholds for intensified screening and medical prevention (and for mutation carriers, risk-reducing mastectomy). Therefore, the overestimation would be unlikely to lead to an inappropriate change in their clinical management." The issues mentioned above have potential consequences for how risk assessment tools should be evaluated in relation to risk-based population breast screening While GPs and specialists are (theoretically) able to refer an unlimited number of patients to services to which they are eligible, resource-constrained population risk-based screening programs would benefit from directing screening protocols for higher-risk (or lower-risk) clients to a priori proportions of the screening population. This could mean, for example, that a screening program would provide supplemental or alternative imaging tests to 10% of women deemed to be most likely to benefit from that imaging, based on their short-term breast cancer risk and the expected accuracy of their routine screening test (indicated by, for example, observed interval cancer rates). For this purpose, it should be sufficient to confirm that a risk tool can identify the 10% of screening clients for whom outcomes (observed rates of breast cancer and interval cancers) under the current approach to screening are significantly higher compared to clients with average outcomes in the screened population, even if that tool is not well calibrated in terms of expected and observed rates; this risk stratification could then be used to trial alternative approaches to screening. This is an important consideration because requiring good E/O calibration of risk assessment tools across the risk spectrum is a difficult standard to reach. For example, a recent evaluation of six established risk models (IBIS, BOADICEA, BRCAPRO, BRCAPRO-BCRAT, BCRAT, and iCARE-lit) in over 52,000 Australian women concluded that only one model (BOADICEA) calibrated well across the spectrum of 15-year risk [26].
Even where good E/O calibration is achieved, this does not necessarily mean that observed rates are ranked well or that calibration is good across the risk spectrum. For example, in the study by McCarthy and colleagues [34], despite BRCAPRO exhibiting goodness-of-fit for the cohort, the observed rates fluctuated for women in the middle deciles, and the assessment of the KRKR tool by Jee et al. [37] on women aged 50 years or older demonstrated good calibration overall but was well-validated for only 30% of the risk groups (of note, this metric is more stringent for studies with a larger number of risk groups such as Jee et al., which had ten). Conversely, models with evidence of miscalibration can demonstrate good differentiation of a higher-risk group. For example, in the study by Hüsing et. al., although BCRmod recalibrated showed evidence of miscalibration to the study population, higher-risk groups (deciles 9-10) were well differentiated [36].
Overall, despite differences between risk assessment tools and study cohorts, most risk tools were able to identify a group of women with the highest risk of breast cancer, with only a few exceptions (Chen v1, ER-, KREA, KRKR and the original Korean model). For lower-risk women, some tools assessed consistently stratified women in the lowest categories of breast cancer risk across different settings (e.g., Tyrer-Cuzick version 8.0b; BRCAPRO version 2.1; iCARE tools). In the case of BCRAT, this depended on the version used; i.e., BCRAT version 3 was found to be consistent in distinguishing women in the lowest risk group whereas the same was not observed for versions 2 and 1. Of note, for some tools it was not possible to assess this feature across different settings as there was only one relevant study included (e.g., BCRmod [36], KREA and KRKR [37], ER+ tool [39]).
The BCRAT tool was the most evaluated risk assessment tool in the included articles, followed by the Tyrer-Cuzick tool, with increasing evaluation of iCARE tools in more recent publications. The number of risk factors considered by the different tools varied considerably. This is an important consideration for policy-makers and health services when selecting the most suitable tool for a specific application, as the number of predictors and the level of detail required for each one can be an impost for women and requires substantial resources to ensure complete and accurate risk information is provided and recorded.
The number of risk groups varied greatly between studies (4-10 groups). Reporting results for more groups provides more detail on how the tool performs as a graded association with increasing risk, which is informative for population-level applications where the availability of resources might be limited. For example, isolating smaller groups of women with very high risk may be more feasible for targeting more costly options (such as MRI) to higher-risk women as part of population breast screening.
We found that mammographic breast density has not been shown to improve the accuracy of breast cancer risk assessment tools based on self-reported information collected from questionnaires. We did not review evidence on the accuracy of breast density alone as a risk assessment tool, with an equivalent assessment of whether other risk predictors improved the accuracy of breast density as a risk assessment tool. However, this is a very active research area, and ongoing review of high-quality evidence is warranted.
Similarly, we found that the addition of a PRS score did not improve accuracy when added to self-reported information within the tools assessed, although this finding was based on a single study [29]. We did not review evidence on the accuracy of PRS alone as a risk assessment tool.

Comparison with other Published Work
A number of other systematic reviews have been published previously in this field [45][46][47][48]. These aimed to provide an overview of published risk assessment tools, basing their assessments on (i) calibration performance using the E/O ratio and (ii) discriminatory accuracy using the area under the receiver operating curve (AUC) and/or concordance statistic (C-statistic). In this review we focused on studies that assessed more than one risk assessment tool on one or more populations, how those tools compared to each other and what overall observations could be drawn by assessing these studies collectively. For this purpose, the AUC and C-statistic are not considered the appropriate metrics for assessing discrimination as they measure the ability of a tool to determine which women are at higher or lower risk of breast cancer than average, but not whether women within a study population have been stratified according to their level of risk, which is critical when evaluating these tools for the purpose of population-based risk-based screening. We recommend the use of observed rates of incident breast cancer according to tool-determined risk groups as it provides a better quantitative assessment of discrimination for this purpose, informing consideration of interventions that might target women at different thresholds of risk across the risk spectrum.

Applicability and Model Performance
We observed that tools that were recalibrated to the risk profiles of the population in which they were applied demonstrated an improvement in fit, as exemplified in the study by Chay and colleagues [32], which compared BCRAT to its Asian-American variant. This improvement in a risk assessment tool highlights the importance of making such adjustments when considering the application of any risk tool, especially on specific populations. Tools are usually developed using breast cancer incidence rates and risk factor data collected from one population and then applied to a different population without adjusting these parameters. This can lead to poorer model performance as the distribution of risk factors and breast cancer incidence can vary across populations. We need, however, to distinguish between recalibration and 'pre-calibration' as exemplified by the iCARE-based tools which uniquely incorporated calibration to population-based age-specific disease incidence rates before they were used [35]. As can be seen from Table 2 and associated graphs, these tools generally performed very well. They fell within the scope of this systematic review as they met the review's criterion of a tool calibrated to the study validation population of interest. This approach in the use of risk-prediction tools seems sensible given that population-based age-specific disease incidence is usually available and, as reinforced by this review, tools without calibration perform very differently in different settings.
Assessment of the studies included, revealed opportunities to improve standardisation of risk tool evaluations. Not all studies cited the specific version of the tool and package used. When these details were not provided, it was difficult for reviewers to deduce this information even if predictors were listed. For example, one study [36] provided a link to the BCRAT tool on the National Cancer Institute (NCI) website and a second study [35] using the same tool, also included the date and year accessed. However, the NCI provides the latest tool versions without detailed history of previous versions and updates; therefore, the version of the tool used by these studies at the time they were conducted had to be deduced. For studies that cited tool versions, these were often determined by the software used; e.g., BCRAT can be run on SAS Macro or R and these packages have their own tool-version numbers. For some models, the software was accessible through different sources. For example, BRCAPRO is accessible via the BayesMendel R package or within the CancerGene software program which now uses the code from BayesMendel. In the case of studies using the latter, even when the CancerGene software version was cited there was insufficient information available from the CancerGene website to deduce which version of the BayesMendel R package was used by that software. For full transparency it is recommended that authors provide the specific version of the risk assessment tools used including the software package, all predictors offered by that version and used in the study being reported.

Risk of Bias and Quality of the Evidence
Critical assessment of studies in terms of risk of bias is required to provide a comprehensive evaluation. We used the recently published PROBAST tool, specifically designed to thoroughly assess the risk of bias in relation to risk assessment tool studies. Only one previous systematic review identified from our searches had included a risk of bias assessment, although a tool for evaluating modelling studies was used instead of PROBAST [42]. All tools we evaluated across studies received an overall rating of 'high risk of bias'. Although this was driven mainly by rulings for the domain of analysis, there was also an evident lack of clarity in the reporting of key details contributing to ratings of 'unclear risk of bias' for 28-66% of tools for the predictor and outcomes domains.
One of the main areas of concern is the domain of predictors with respect to the collection and completeness of data on risk predictors, and the statistical methods used to deal with any data issues. One method that studies used to deal with missing predictors at the time of validation was multiple imputation (29,35). Although this is a common method to deal with missing data, the reference dataset is simulated and thus possibly less reliable. This also limits our understanding of how missing data would be addressed at an individual level if the tool were utilised as part of health service provision. In other studies, researchers sometimes stated that missing data was handled according to the specifications of each software application (e.g., McCarthy et al. [34], Jantzen et al. [33]), however it was not always clear whether a predictor value was then classed as missing or whether the predictor was omitted from the tool (e.g., BRCAPRO version 2.1-3, Tyrer-Cuzick version 8.0b and BOADICEA v3 in Terry et al. [31]). In other cases, the approach to handling missing data was not reported. For example, Brentnall et al. [38] applied version 7.02 of the Tyrer-Cuzick model (developed in the UK) on a US cohort; this version included prior use of hormone replacement therapy (HRT) (yes/no) as a predictor without an option of selecting 'unknown' or 'missing' if these data were unavailable. The authors did not report any information regarding the collection of HRT data or how missing data was handled. Overall, it is not possible to evaluate the precise effect of missing predictor values on risk estimates unless provision of a 'missing' option has been made by tool providers, which may indeed be more reflective of actual use of tools in practice as sometimes information on predictors cannot be recalled. We recommend that future studies consider including information on how missing data are managed, as this would improve comparability between studies and help recognise the challenges of applying risk assessment tools to different settings and study populations. Overall, factors identified by our risk of bias analysis could potentially explain some of the observed differences in tool performance in different settings described throughout this review.
We also recommend more standardised and transparent reporting of risk assessment tools, using the 'Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis' (TRIPOD) statement published in 2015 [49]. TRIPOD provides a 22item checklist considered to be key for transparent reporting of risk assessment tool studies. The statement was created to increase the level of reporting standards as prior studies performing external validation of risk assessment tools were found to commonly lack clarity in reporting and tended not to present important details needed to understand how the tool might be applied or whether results reflected true performance of the tool [50]. This was reflected in a systematic review examining the methodological conduct and reporting of external validation studies for risk assessment tools that found that of 45 articles published in 2010, 16% did not report the number of outcome events when validating tools, 54% did not acknowledge missing data, and frequently, it was unclear as to whether the authors had applied a complete or an abridged version of the tool [50]. For our analysis, four studies [30,32,40,41] were published prior to the TRIPOD statement, however no studies published after 2015 refer to the TRIPOD statement or checklist.

Limitations
This systematic review has certain limitations. A number of studies that compared different risk assessment tools on the same population were not included due to the focus of this review to compare risk assessment tools generated from, or calibrated to, a different population to the study validation population of interest or tools specifically calibrated to the study population of interest. However, focusing on the selected studies in this review enabled a fairer comparison between tools and improvement in the quality of the evidence. Secondly, despite meeting the criteria for inclusion, some studies had to be excluded due to some required data being unavailable for full assessment. Nonetheless, efforts were made to contact the authors. Additionally, for studies which did not provide the number of women in each risk category, the calculated estimates may be inaccurate if numbers are distributed unequally between risk categories, as the number of women per category was estimated by dividing the numbers of participants equally among categories.
This review did not compare tools in terms of interval cancers (i.e., cancers diagnosed following a negative population screening test), breast cancer mortality, nor incidence of breast cancer defined by different tumour characteristics (e.g., sub-type, size, grade, nodal involvement). We did initially seek to assess these outcomes as this evidence is likely to be of interest for some applications, such as consideration of risk-based screening protocols, however insufficient evidence was available to make these comparisons between tools.
Finally, one of the methods used to assess risk assessment tools was based on E/O point estimates and their 95%CI including unity (E/O = 1). Studies where tools were applied on small cohorts of women have wider CIs and therefore be more likely to include this value compared to larger studies which have narrower CIs. Additionally, we characterised the functional form of observed cancer rates according to risk groups based on point estimates reported without uncertainty estimates (e.g., CIs). However, while we acknowledge these metrics have minor limitations, these were only two of the metrics employed; when evaluated collectively all metrics analysed provide sufficient information to enable a fair and balanced assessment of risk assessment tools.

Conclusions
This systematic review identified various questionnaire-based tools (sometimes incorporating mammographic density or genetic information) that are effective in assigning women to risk groups for incident breast cancer, for various metrics of tool performance. The most appropriate metrics to consider depend on how the risk tool is to be applied. While good calibration between expected and observed rates is essential for individual-level estimated breast cancer risk described as a rate over a specified period, tools demonstrating good differentiation of observed breast cancer incidence rates are potentially suitable for triaging women to population-level risk-based interventions such as risk-based breast cancer screening, even if they are not well calibrated in terms of expected versus observed outcomes across the risk spectrum. Current trials such as MyPeBS [51] and WISDOM [52] are allocating women to risk-based screening protocols based on their predicted risk of breast cancer as estimated by combining genetic information with scores from risk assessment tools which incorporate mammographic density (Tyrer-Cuzick and Breast Cancer Surveillance Consortium (BCSC) risk tools for MyPeBS [53]; BCSC risk tool for WIS-DOM [54]). Results from these studies will provide valuable information on the clinical utility of these detailed and resource-intensive risk assessment tools; in parallel, work is required to understand the relative utility of more parsimonious tools that may achieve similar outcomes while markedly reducing the impost of risk assessment on women and health services.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/cancers15041124/s1, Figure S1: additional graphs of risk calibration and observed rate of incident breast cancer for tools in included studies; Table S1: search strategy; Table S2: risk predictors within risk assessment tools compared in the studies included in the review; Table S3: detailed assessment of risk of bias of included risk assessment tools; List S1: articles excluded from the review at full text screening stage by reason for exclusion; Dataset S1: data extracted from included articles. Funding: This research was conducted as part of the Roadmap to Optimising Screening in Australia for Breast (ROSA) project which received funding from the Australian Government Department of Health (contract reference 2000004049). The funder had no role in the study design, data collection and analysis, the decision to publish or the preparation of the manuscript. The funder had no role in the study design, data collection and analysis, the decision to publish or the preparation of the manuscript.
Informed Consent Statement: Written informed consent was not required as this review used data from previously published studies.

Data Availability Statement:
The data supporting the findings of this work are available in the cited articles and in the manuscript's Supplementary Materials.